# Lab 26 - k-Nearest Neighbors classifier 2

We will continue using the Titanic training and test data from [Kaggle](https://www.kaggle.com/c/titanic) from Lab 24 and 25.

First import the necessary libraries.

In [34]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline

### Loading and cleaning the data

The code for loading in the data and cleaning it from Lab 25 is below.

In [55]:
train = pd.read_csv("../Data/train.csv")
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [56]:
# fill in missing age data
train["Age"] = train["Age"].fillna(train["Age"].median())

# fill in the missing embarked data
train["Embarked"] = train["Embarked"].fillna("S")

In [57]:
# create dummy variables for passenger class, sex, and embarked
train2 = pd.get_dummies(train, columns = ["Pclass","Sex","Embarked"], drop_first = True)
train2.head()

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,1,0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,0,1,1,0,1
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,0,0,0,0,0
2,3,1,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,0,1,0,0,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,0,0,0,0,1
4,5,0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,0,1,1,0,1


In [58]:
# remove the remaining qualitative columns
train2.drop("Cabin",axis = 1,inplace = True)
train2.drop("Name",axis = 1,inplace = True)
train2.drop("Ticket",axis = 1,inplace = True)

# we should also drop PassengerId, although we did not last lab
train2.drop("PassengerId",axis = 1, inplace = True)

In [59]:
# split the original training data into training and sets sets
X_train,X_test,y_train, y_test =train_test_split(train2.drop("Survived",axis=1),train2["Survived"],test_size = 0.2)

In [60]:
# create a 3-nearest neighbor classifirer
knn = KNeighborsClassifier(n_neighbors=3)
# fit the classifier to the training data
knn.fit(X_train, y_train)
# test and score the classifier on our test data (part of the original training data)
# notice this line corrects a mistake in Lab 25
knn.score(X_test, y_test)

0.72625698324022347

Now we are going to try running our classifier on the Kaggle test data and use all of our training data to fit the classifier.

First, load the test data from Kaggle into the dataframe `test`.

In [61]:
test = pd.read_csv("../Data/test.csv")
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


We have to process the test data in the same way as the training data, namely filling in the missing age and embarked data, creating the dummy variables for Pclass, Sex, and Embarked, and dropping the Cabin, Name, and Ticket columns.  Do this below, adding as many extra cells as you need.

In [62]:
test["Age"] = test["Age"].fillna(test["Age"].median())
test["Embarked"] = test["Embarked"].fillna("S")

test2 = pd.get_dummies(test, columns = ["Pclass","Sex","Embarked"], drop_first = True)

test2.drop("Cabin",axis = 1,inplace = True)
test2.drop("Name",axis = 1,inplace = True)
test2.drop("Ticket",axis = 1,inplace = True)

test2.head()

Unnamed: 0,PassengerId,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,892,34.5,0,0,7.8292,0,1,1,1,0
1,893,47.0,1,0,7.0,0,1,0,0,1
2,894,62.0,0,0,9.6875,1,0,1,1,0
3,895,27.0,0,0,8.6625,0,1,1,0,1
4,896,22.0,1,1,12.2875,0,1,0,0,1


Store the column PassengerId in a variable.  We need this information for submitting our predictions to Kaggle, but don't want to use it in making the pedictions.

In [63]:
passengerId = test2["PassengerId"]
passengerId

0       892
1       893
2       894
3       895
4       896
5       897
6       898
7       899
8       900
9       901
10      902
11      903
12      904
13      905
14      906
15      907
16      908
17      909
18      910
19      911
20      912
21      913
22      914
23      915
24      916
25      917
26      918
27      919
28      920
29      921
       ... 
388    1280
389    1281
390    1282
391    1283
392    1284
393    1285
394    1286
395    1287
396    1288
397    1289
398    1290
399    1291
400    1292
401    1293
402    1294
403    1295
404    1296
405    1297
406    1298
407    1299
408    1300
409    1301
410    1302
411    1303
412    1304
413    1305
414    1306
415    1307
416    1308
417    1309
Name: PassengerId, Length: 418, dtype: int64

Drop the `PassengerId` column from the test data.

In [64]:
test2.drop("PassengerId",axis = 1, inplace = True)

In [65]:
test2.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,34.5,0,0,7.8292,0,1,1,1,0
1,47.0,1,0,7.0,0,1,0,0,1
2,62.0,0,0,9.6875,1,0,1,1,0
3,27.0,0,0,8.6625,0,1,1,0,1
4,22.0,1,1,12.2875,0,1,0,0,1


Next we split up our training data into the answer (the `Survived` column) and the input data (all other columns).

First, store the `Survived` column in the variable `y_train_kaggle`.

In [72]:
y_train_kaggle = train2["Survived"]
y_train_kaggle.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Next, drop the `Survived` column from the training data, and store the new data frame in the variable `X_train_kaggle`.

In [71]:
X_train_kaggle = train2.drop("Survived",axis = 1)
X_train_kaggle.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,22.0,1,0,7.25,0,1,1,0,1
1,38.0,1,0,71.2833,0,0,0,0,0
2,26.0,0,0,7.925,0,1,0,0,1
3,35.0,1,0,53.1,0,0,0,0,1
4,35.0,0,0,8.05,0,1,1,0,1


Create a new k-nearest neighbors object and fit it on the entire training data (`X_train_kaggle`).

In [68]:
knn_kaggle = KNeighborsClassifier(n_neighbors=3)
knn_kaggle.fit(X_train_kaggle,y_train_kaggle)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

Now use our fitted classifier to make predictions on the test data (`test2`), and store it in the variable `y_pred`.

In [73]:
y_pred = knn_kaggle.predict(test2)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The error message says something about NaN.  Could there be missing data (NaN) is a different column in the test data?  Use describe to see if this is the case.

In [74]:
test2.describe()

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
count,418.0,418.0,418.0,417.0,418.0,418.0,418.0,418.0,418.0
mean,29.599282,0.447368,0.392344,35.627188,0.222488,0.521531,0.636364,0.110048,0.645933
std,12.70377,0.89676,0.981429,55.907576,0.416416,0.500135,0.481622,0.313324,0.478803
min,0.17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,23.0,0.0,0.0,7.8958,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,14.4542,0.0,1.0,1.0,0.0,1.0
75%,35.75,1.0,0.0,31.5,0.0,1.0,1.0,0.0,1.0
max,76.0,8.0,9.0,512.3292,1.0,1.0,1.0,1.0,1.0


The fare column is missing one value.  Fill it in with the median fare.

In [75]:
test2["Fare"] = test2["Fare"].fillna(test2["Fare"].median())

Now try making the prediction again.

In [76]:
y_pred = knn_kaggle.predict(test2)

In [77]:
y_pred

array([0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1,

Finally, we want to write our prediction to a file, along with the passenger ID.  

In [78]:
df = pd.DataFrame(data = {"PassengerId":passengerId, "Survived":y_pred})

In [79]:
df.to_csv("test1.csv",index = 0)